Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages

نویسندگان

Nasser Zalmout

Nizar Habash

چکیده

Tokenization is very helpful for StatisticalMachine Translation (SMT), especiallywhen translating from morphologically rich languages. Typically, a single tokenization scheme is applied to the entire source-language text and regardless of the target language. In this paper, we evaluate the hypothesis that SMT performance may benefit from different tokenization schemes for different words within the same text, and also for different target languages. We apply this approach to Arabic as a source language, with five target languages of varying morphological complexity: English, French, Spanish, Russian and Chinese. Our results show that different target languages indeed require different source-language schemes; and a context-variable tokenization scheme can outperform a context-constant scheme with a statistically significant performance enhancement of about 1.4 BLEU points.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bayesian Learning of Tokenization for Machine Translation

Training a statistical machine translation system starts with tokenizing a parallel corpus. Some languages such as Chinese do not incorporate spacing in their writing system, which creates a challenge for tokenization. Morphologically rich languages such as Korean and Hungarian present an even bigger challenge, since optimal token boundaries for machine translation in these languages are often ...

متن کامل

Unsupervised Tokenization for Machine Translation

Training a statistical machine translation starts with tokenizing a parallel corpus. Some languages such as Chinese do not incorporate spacing in their writing system, which creates a challenge for tokenization. Moreover, morphologically rich languages such as Korean present an even bigger challenge, since optimal token boundaries for machine translation in these languages are often unclear. Bo...

متن کامل

Techniques for Arabic Morphological Detokenization and Orthographic Denormalization

The common wisdom in the field of Natural Language Processing (NLP) is that orthographic normalization and morphological tokenization help in many NLP applications for morphologically rich languages like Arabic. However, when Arabic is the target output, it should be properly detokenized and orthographically correct. We examine a set of six detokenization techniques over various tokenization sc...

متن کامل

Improving Patent Translation using Bilingual Term Extraction and Re-tokenization for Chinese-Japanese

Unlike European languages, many Asian languages like Chinese and Japanese do not have typographic boundaries in written system. Word segmentation (tokenization) that break sentences down into individual words (tokens) is normally treated as the first step for machine translation (MT). For Chinese and Japanese, different rules and segmentation tools lead different segmentation results in differe...

متن کامل

Lattice-Based Recurrent Neural Network Encoders for Neural Machine Translation

Neural machine translation (NMT) heavily relies on word-level modelling to learn semantic representations of input sentences. However, for languages without natural word delimiters (e.g., Chinese) where input sentences have to be tokenized first, conventional NMT is confronted with two issues: 1) it is difficult to find an optimal tokenization granularity for source sentence modelling, and 2) e...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages

نویسندگان

چکیده

منابع مشابه

Bayesian Learning of Tokenization for Machine Translation

Unsupervised Tokenization for Machine Translation

Techniques for Arabic Morphological Detokenization and Orthographic Denormalization

Improving Patent Translation using Bilingual Term Extraction and Re-tokenization for Chinese-Japanese

Lattice-Based Recurrent Neural Network Encoders for Neural Machine Translation

عنوان ژورنال:

اشتراک گذاری